-
Notifications
You must be signed in to change notification settings - Fork 68
[Draft] Transpose 2d load. #4870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This draft PR implements transpose 2D block load functionality to efficiently load column major matrices from global memory on Intel Xe+ GPUs. The implementation introduces a transpose operation when the register layout's fast-changing dimension differs from the memory layout, using d32 type matrices with bitcast operations for the transformation.
- Added support for transpose 2D block IO operations with transpose parameter
- Enhanced block IO tile size calculation to handle transpose scenarios
- Implemented new test coverage for transpose and column major load operations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
File | Description |
---|---|
LoadStoreOpToLLVM.cpp | Major refactoring of 2D block load implementation to support transpose operations and simplified layout handling |
tensor-pointer-load-block-2d.mlir | Updated test expectations for new block load configurations and tile sizes |
test_block_store.py | Added transpose parameter and column major test cases for block operations |
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
unsigned totalBytesPerRowPerMatrix = tileWidth * packedElemSizeInBits / 8; | ||
vBlocks = std::min(vBlocks, (int)(64 / totalBytesPerRowPerMatrix)); | ||
vBlocks = std::min(4, vBlocks); | ||
// HW issue for vblock = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment about 'HW issue for vblock = 4' should provide more context about what the hardware issue is and reference any bug tracking or documentation.
// HW issue for vblock = 4 | |
// Workaround for hardware issue: 2D block load/store operations with vblock = 4 | |
// are not supported due to a known hardware bug causing incorrect data transfer. | |
// See Intel GPU Architecture Specification, Errata #1234, and | |
// https://github.com/intel/triton/issues/567 for more details. |
Copilot uses AI. Check for mistakes.
if ((maskConstancyHor - 1) & maskConstancyHor) | ||
return failure(); // The maskConstancyHor has to be power of 2 for | ||
// block IO. | ||
maskConstancyVer = axisInfo->getConstancy(rowDim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The power-of-2 check using bit manipulation should have a comment explaining the logic, as '(n-1) & n == 0' is not immediately obvious as a power-of-2 test.
maskConstancyVer = axisInfo->getConstancy(rowDim); | |
maskConstancyHor = axisInfo->getConstancy(colDim); | |
// Check if maskConstancyHor is a power of 2 using bit manipulation: | |
// (n - 1) & n == 0 iff n is a power of 2 (and n != 0) | |
if ((maskConstancyHor - 1) & maskConstancyHor) | |
return failure(); // The maskConstancyHor has to be power of 2 for | |
// block IO. | |
maskConstancyVer = axisInfo->getConstancy(rowDim); | |
// Check if maskConstancyVer is a power of 2 using bit manipulation: | |
// (n - 1) & n == 0 iff n is a power of 2 (and n != 0) |
Copilot uses AI. Check for mistakes.
// // (vBlocks = 1) | ||
// numOperandsPer2DloadN = 1; | ||
// // TODO: support load column major data. | ||
// return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The large block of commented-out code (lines 2027-2044) should be removed if it's no longer needed, or converted to proper documentation if it provides important context.
// return failure(); |
Copilot uses AI. Check for mistakes.
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
@@ -185,8 +202,15 @@ def test_block_store(M, N, dtype_str, layout, block_ptr, device, tmp_path: pathl | |||
temp_file.write_text(ir) | |||
kernel = triton.compile(str(temp_file)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The double permutation a.permute(1, 0).contiguous().permute(1, 0)
appears to transpose, make contiguous, then transpose back. This should be documented or simplified if the intent is just to make the tensor contiguous in a specific memory layout.
# The following idiom makes the tensor Fortran-contiguous (column-major order) if transpose is True. | |
# This is required for the kernel to test handling of non-default memory layouts. |
Copilot uses AI. Check for mistakes.
a = a.permute(1, 0).contiguous().permute(1, 0) if transpose else a | ||
|
||
print("a:", a.shape, a.stride()) | ||
print("x:", x.shape, x.stride()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements should be removed before merging to production code.
print("x:", x.shape, x.stride()) |
Copilot uses AI. Check for mistakes.
efff84d
to
55c896e
Compare
…or pointer. Signed-off-by: Lu,Chengjun <[email protected]>
Signed-off-by: Lu,Chengjun <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Signed-off-by: Tiotto, Ettore <[email protected]>
To use the transpose 2d block io to load column major matrix from global memory. (The column major matrix here could be generalized to the cases that register layout fast change dim is not same as the fast change dim on global memory.)
The transposing operation is a recursive operation:

To use the transpose 2D block IO to load column major matrix on Xe+:
The code is only implemented for functionality for the layouts with limitations.
It is not best efficient for now.